Clustering is an unsupervised learning technique that you can use to determine which entities or objects, typically represented by rows in your data, are similar to each other. For text mining, you can use clustering to determine, for example, which documents are similar.
Let's start with a simple example of hierarchical clustering using numeric data. The three lines of data below represent days of rainfall in a number of Canadian cities. Can we use hierarchical clustering to get a sense of which cities are similar to each other?
city=c("Montreal","Ottawa","Toronto","Quebec City", "Kingston", "Trois-Rivieres","Windsor","Hamilton","London","Halifax","Moncton","Saint John", "St. John's", "Sudbury", "Thunder Bay", "Winnipeg", "Saskatoon", "Regina", "Calgary", "Edmonton", "Kelowna", "Vancouver", "Victoria")
rainfall=c(1000,920,831,1184,960,1123,935,897,1012, 1468, 1124, 1295, 1534, 903, 684, 521,365,390, 456,419,345,1457,705)
days=c(163,161,145,175,159,161,150,149,168, 162, 161, 158, 212, 167, 143, 125, 87,118, 112,123,120,168,148)
The code below starts by creating a data frame for the data. The hclust function creates the clustering information, which the plot function can then use to create a dendogram.
rain.data <- data.frame(city,rainfall,days)
dist.rain <- dist(scale(rain.data[,2:3]))
hc <- hclust(dist.rain) # distances as hc object
plot(hc,labels = rain.data$city)
Now that we've seen a simple example of how hierarchical clustering works with numeric data, let's turn our attention back to text data, and the hockey dataset.
We'll start by getting the recaps of games written by the Associated Press (AP), and cleaning this data (removing stop words, etc.) Next, we'll create a term document matrix (tdm) from the cleaned text data.
At this point, we can switch over to clustering the data. We use the term document matrix to generate a distance matrix (dm). We might consider either document or term similarity in our distance matrix. In the initial example, we are looking at term similarity.
With the distance matrix in hand, we can then hierarchically cluster the data, and generate a dendogram for the clustering.
# load the text mining libraries
library('tm')
library('qdap')
Loading required package: NLP
Loading required package: qdapDictionaries
Loading required package: qdapRegex
Loading required package: qdapTools
Loading required package: RColorBrewer
Attaching package: ‘qdap’
The following objects are masked from ‘package:tm’:
as.DocumentTermMatrix, as.TermDocumentMatrix
The following object is masked from ‘package:NLP’:
ngrams
The following object is masked from ‘package:base’:
Filter
# Import text data
recaps <- read.csv(file="Data/Recap_data_first_pass_utf8.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
# Isolate all of the text 'blobs' from the Associated Press recaps of the games: AP.recaps
AP.recaps <- recaps$AP.Recap
# Make a vector source: AP.recaps.source
AP.recaps.source <- VectorSource(AP.recaps)
# Make a volatile corpus: AP.recaps.corpus
AP.recaps.corpus <- VCorpus(AP.recaps.source)
# Create a customized function to clean the corpus
clean_corpus_Sens <- function(corpus){
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("en"),"game", "first", "second", "third", "Ottawa", "Senators"))
return(corpus)
}
# Apply your customized function (clean_corp.AP.recaps, created above) to the AP.recaps.corpus
clean_corp.AP.recaps <- clean_corpus_Sens(AP.recaps.corpus)
# Create the dtm from clean_corp.AP.recaps: AP.recaps_dtm
#AP.recaps_dtm <- DocumentTermMatrix(clean_corp.AP.recaps)
# Convert AP.recaps_dtm to a matrix: AP.recaps_m
#AP.recaps_dtm_m <- as.matrix(AP.recaps_dtm)
# Create a TDM from clean_corp.AP.recaps: AP.recaps_tdm
AP.recaps_tdm <- TermDocumentMatrix(clean_corp.AP.recaps)
# Remove sparse terms
AP.recaps_tdm_90 <- removeSparseTerms(AP.recaps_tdm,sparse=.5)
#AP.recaps_tdm_90 <- removeSparseTerms(AP.recaps_tdm,sparse=.9)
#AP.recaps_tdm_95 <- removeSparseTerms(AP.recaps_tdm,sparse=.95)
#AP.recaps_tdm_99 <- removeSparseTerms(AP.recaps_tdm,sparse=.99)
# What effect did this have on the number of terms?
AP.recaps_tdm_90
#AP.recaps_tdm_95
#AP.recaps_tdm_99
# Convert AP.recaps_tdm to a matrix: AP.recaps_m
AP.recaps_tdm_90_m <- as.matrix(AP.recaps_tdm_90)
#AP.recaps_tdm_95_m <- as.matrix(AP.recaps_tdm_95)
#AP.recaps_tdm_99_m <- as.matrix(AP.recaps_tdm_99)
<<TermDocumentMatrix (terms: 82, documents: 101)>> Non-/sparse entries: 5806/2476 Sparsity : 30% Maximal term length: 8 Weighting : term frequency (tf)
# Save the term document matrix as a data frame
TDM_90.df <- as.data.frame(AP.recaps_tdm_90_m)
#TDM_95.df <- as.data.frame(AP.recaps_tdm_95_m)
#TDM_99.df <- as.data.frame(AP.recaps_tdm_99_m)
TDM_90.df
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ⋯ | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| also | 2 | 0 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | ⋯ | 4 | 1 | 1 | 1 | 0 | 2 | 0 | 1 | 2 | 0 |
| anderson | 3 | 3 | 0 | 2 | 3 | 5 | 1 | 13 | 6 | 1 | ⋯ | 4 | 3 | 5 | 4 | 5 | 2 | 2 | 2 | 10 | 6 |
| assist | 1 | 0 | 1 | 2 | 0 | 2 | 2 | 0 | 2 | 0 | ⋯ | 4 | 1 | 2 | 1 | 1 | 0 | 2 | 4 | 0 | 2 |
| back | 0 | 3 | 0 | 1 | 0 | 0 | 2 | 1 | 1 | 0 | ⋯ | 4 | 5 | 1 | 1 | 3 | 2 | 1 | 1 | 4 | 1 |
| beat | 1 | 3 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 0 | ⋯ | 2 | 1 | 2 | 0 | 0 | 3 | 1 | 3 | 2 | 0 |
| befor | 1 | 0 | 1 | 4 | 0 | 3 | 2 | 0 | 0 | 1 | ⋯ | 1 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 1 |
| boucher | 2 | 2 | 0 | 0 | 3 | 2 | 0 | 0 | 2 | 1 | ⋯ | 4 | 3 | 0 | 4 | 1 | 2 | 1 | 3 | 3 | 1 |
| came | 2 | 1 | 0 | 1 | 2 | 2 | 0 | 1 | 0 | 3 | ⋯ | 2 | 2 | 1 | 1 | 2 | 2 | 0 | 1 | 1 | 1 |
| chanc | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 2 | ⋯ | 2 | 0 | 5 | 2 | 0 | 2 | 0 | 1 | 1 | 0 |
| coach | 4 | 3 | 0 | 1 | 2 | 1 | 1 | 0 | 3 | 5 | ⋯ | 2 | 2 | 1 | 2 | 1 | 3 | 2 | 2 | 2 | 2 |
| come | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 2 | ⋯ | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| craig | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ⋯ | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 |
| didnt | 1 | 2 | 0 | 1 | 2 | 2 | 1 | 2 | 0 | 0 | ⋯ | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 5 | 1 |
| end | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | ⋯ | 1 | 1 | 0 | 0 | 4 | 1 | 0 | 0 | 0 | 1 |
| erik | 2 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 2 | ⋯ | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 2 |
| final | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 1 | ⋯ | 2 | 2 | 4 | 4 | 3 | 2 | 1 | 6 | 3 | 5 |
| five | 1 | 0 | 0 | 1 | 0 | 1 | 3 | 0 | 1 | 2 | ⋯ | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |
| four | 2 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 2 | 1 | ⋯ | 3 | 1 | 0 | 2 | 0 | 2 | 1 | 3 | 2 | 3 |
| gave | 0 | 1 | 1 | 2 | 0 | 0 | 2 | 1 | 0 | 0 | ⋯ | 0 | 2 | 0 | 2 | 0 | 1 | 1 | 0 | 2 | 0 |
| get | 3 | 2 | 0 | 3 | 2 | 3 | 1 | 0 | 1 | 2 | ⋯ | 2 | 3 | 2 | 5 | 3 | 3 | 3 | 2 | 4 | 2 |
| give | 1 | 1 | 0 | 4 | 1 | 0 | 0 | 1 | 0 | 0 | ⋯ | 0 | 1 | 0 | 3 | 1 | 2 | 0 | 2 | 0 | 2 |
| goal | 9 | 2 | 9 | 17 | 4 | 4 | 6 | 1 | 3 | 4 | ⋯ | 8 | 13 | 5 | 5 | 1 | 13 | 5 | 4 | 5 | 4 |
| good | 1 | 0 | 0 | 2 | 1 | 4 | 1 | 0 | 3 | 3 | ⋯ | 3 | 1 | 2 | 2 | 1 | 0 | 5 | 3 | 1 | 0 |
| got | 1 | 2 | 0 | 3 | 5 | 0 | 1 | 1 | 0 | 1 | ⋯ | 4 | 2 | 5 | 1 | 0 | 4 | 2 | 1 | 2 | 1 |
| great | 1 | 3 | 0 | 2 | 0 | 3 | 0 | 2 | 0 | 0 | ⋯ | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| guy | 4 | 1 | 0 | 1 | 2 | 3 | 2 | 0 | 2 | 1 | ⋯ | 1 | 4 | 2 | 2 | 2 | 1 | 2 | 1 | 1 | 3 |
| hoffman | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 4 | 0 | 2 | ⋯ | 0 | 2 | 2 | 0 | 0 | 6 | 1 | 0 | 3 | 0 |
| host | 2 | 1 | 1 | 2 | 0 | 1 | 1 | 0 | 1 | 1 | ⋯ | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| injuri | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 2 | 3 | ⋯ | 3 | 0 | 0 | 0 | 1 | 2 | 2 | 0 | 0 | 4 |
| just | 1 | 2 | 0 | 6 | 2 | 8 | 6 | 1 | 1 | 3 | ⋯ | 3 | 4 | 1 | 6 | 4 | 3 | 4 | 2 | 4 | 4 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| period | 5 | 4 | 4 | 6 | 3 | 3 | 1 | 2 | 3 | 3 | ⋯ | 5 | 1 | 6 | 3 | 4 | 4 | 4 | 4 | 5 | 1 |
| play | 5 | 2 | 3 | 4 | 10 | 7 | 6 | 2 | 5 | 7 | ⋯ | 7 | 2 | 5 | 7 | 3 | 4 | 5 | 9 | 5 | 3 |
| point | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 4 | 1 | ⋯ | 1 | 1 | 0 | 3 | 1 | 2 | 3 | 1 | 5 | 1 |
| power | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | ⋯ | 0 | 0 | 2 | 2 | 1 | 0 | 1 | 2 | 1 | 0 |
| puck | 2 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 2 | ⋯ | 5 | 4 | 1 | 4 | 2 | 2 | 2 | 0 | 2 | 1 |
| put | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | ⋯ | 0 | 1 | 3 | 2 | 0 | 1 | 0 | 1 | 4 | 1 |
| right | 2 | 0 | 0 | 0 | 3 | 2 | 0 | 0 | 2 | 1 | ⋯ | 2 | 1 | 3 | 1 | 2 | 1 | 6 | 0 | 0 | 4 |
| ryan | 1 | 1 | 1 | 1 | 0 | 6 | 0 | 2 | 1 | 0 | ⋯ | 1 | 1 | 0 | 5 | 0 | 2 | 4 | 0 | 5 | 1 |
| said | 7 | 5 | 0 | 4 | 6 | 8 | 5 | 5 | 3 | 5 | ⋯ | 9 | 8 | 4 | 6 | 5 | 7 | 7 | 6 | 5 | 5 |
| saturday | 2 | 2 | 0 | 2 | 1 | 1 | 0 | 0 | 1 | 3 | ⋯ | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| save | 1 | 1 | 1 | 3 | 2 | 3 | 0 | 4 | 1 | 3 | ⋯ | 2 | 1 | 3 | 2 | 1 | 2 | 5 | 0 | 2 | 1 |
| score | 9 | 8 | 3 | 11 | 4 | 2 | 5 | 3 | 4 | 2 | ⋯ | 11 | 8 | 5 | 3 | 0 | 4 | 3 | 3 | 6 | 3 |
| scratch | 2 | 2 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 2 | ⋯ | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 1 |
| season | 3 | 1 | 7 | 8 | 4 | 6 | 2 | 5 | 0 | 2 | ⋯ | 1 | 0 | 0 | 1 | 2 | 4 | 0 | 0 | 0 | 0 |
| senat | 5 | 7 | 4 | 16 | 9 | 4 | 5 | 4 | 9 | 10 | ⋯ | 7 | 5 | 10 | 10 | 3 | 5 | 4 | 5 | 6 | 6 |
| shot | 4 | 3 | 1 | 8 | 1 | 4 | 9 | 0 | 4 | 2 | ⋯ | 4 | 9 | 6 | 4 | 7 | 5 | 5 | 1 | 5 | 7 |
| start | 0 | 2 | 2 | 2 | 2 | 2 | 5 | 6 | 4 | 1 | ⋯ | 0 | 1 | 0 | 0 | 2 | 2 | 3 | 1 | 1 | 0 |
| stop | 1 | 1 | 1 | 2 | 1 | 3 | 2 | 1 | 2 | 1 | ⋯ | 2 | 1 | 3 | 0 | 2 | 1 | 3 | 1 | 5 | 1 |
| straight | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| team | 0 | 1 | 1 | 0 | 2 | 2 | 3 | 3 | 5 | 2 | ⋯ | 3 | 1 | 1 | 1 | 2 | 5 | 2 | 3 | 0 | 8 |
| think | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ⋯ | 2 | 0 | 1 | 0 | 2 | 2 | 4 | 2 | 2 | 1 |
| three | 2 | 0 | 2 | 3 | 0 | 3 | 5 | 1 | 0 | 0 | ⋯ | 0 | 1 | 1 | 0 | 1 | 2 | 3 | 3 | 2 | 0 |
| thursday | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 1 | 2 | ⋯ | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| tie | 1 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 2 | 0 | ⋯ | 3 | 9 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
| time | 0 | 2 | 2 | 0 | 4 | 3 | 0 | 0 | 2 | 2 | ⋯ | 2 | 3 | 3 | 2 | 6 | 0 | 3 | 4 | 1 | 2 |
| tuesday | 0 | 1 | 1 | 4 | 2 | 2 | 1 | 2 | 2 | 0 | ⋯ | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| two | 1 | 2 | 3 | 4 | 0 | 4 | 4 | 3 | 0 | 2 | ⋯ | 9 | 5 | 2 | 2 | 2 | 1 | 6 | 3 | 4 | 3 |
| way | 1 | 3 | 0 | 0 | 1 | 2 | 1 | 1 | 2 | 1 | ⋯ | 0 | 2 | 1 | 2 | 2 | 0 | 1 | 1 | 1 | 5 |
| went | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | ⋯ | 1 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 0 | 0 |
| win | 1 | 4 | 0 | 2 | 0 | 0 | 2 | 2 | 1 | 0 | ⋯ | 3 | 6 | 0 | 1 | 2 | 3 | 0 | 1 | 1 | 1 |
# Compute distance matrices
dist_90 = dist(TDM_90.df)
dist_90
#dist_95 = dist(TDM_95.df)
#dist_99 = dist(TDM_99.df)
#Note here that the distance matrix is extremely large! Keep scrolling...
also anderson assist back beat befor boucher
anderson 28.160256
assist 14.798649 30.594117
back 15.874508 26.095977 18.947295
beat 15.524175 27.313001 18.000000 15.066519
befor 12.529964 30.331502 16.613248 17.464249 14.142136
boucher 14.247807 27.676705 16.431677 16.583124 15.362291 16.613248
came 14.387495 28.879058 16.792856 16.643317 16.552945 13.341664 16.000000
chanc 16.431677 29.342802 19.000000 17.888544 14.525839 14.247807 16.583124
coach 15.394804 27.928480 17.549929 16.340135 14.282857 16.431677 11.489125
come 13.266499 29.103264 17.000000 15.811388 14.177447 12.609520 13.892444
craig 12.369317 26.343880 16.370706 14.525839 12.806248 9.695360 14.142136
didnt 14.071247 27.982137 18.027756 15.427249 14.798649 14.730920 16.703293
end 13.747727 29.291637 16.062378 17.291616 15.748016 14.000000 15.748016
erik 12.369317 28.600699 15.874508 16.278821 12.961481 10.677078 14.282857
final 15.905974 28.284271 17.378147 17.691806 16.852300 15.427249 15.874508
five 13.856406 31.701735 15.394804 17.549929 15.394804 11.445523 16.703293
four 14.798649 29.257478 16.431677 18.248288 16.186414 15.231546 16.431677
gave 13.711309 29.342802 16.340135 17.549929 15.198684 11.618950 15.842980
get 21.725561 27.331301 23.086793 21.307276 19.723083 22.605309 19.874607
give 14.491377 30.149627 17.635192 17.320508 15.716234 12.688578 16.340135
goal 56.973678 56.577381 55.614746 55.910643 55.434646 59.974995 56.894639
good 20.049938 30.708305 20.952327 22.226111 18.303005 19.773720 19.364917
got 19.000000 29.732137 21.213203 19.104973 17.378147 21.447611 19.949937
great 14.456832 29.120440 18.973666 16.941074 14.560220 12.727922 16.248077
guy 15.779734 26.832816 16.613248 17.058722 15.684387 17.320508 13.266499
hoffman 17.776389 30.380915 20.273135 19.949937 18.193405 18.083141 20.024984
host 13.820275 30.133038 16.309506 16.401219 13.564660 9.899495 15.556349
injuri 14.730920 29.189039 16.492423 17.804494 16.370706 14.422205 14.764823
just 26.664583 28.248894 27.166155 25.436195 25.922963 27.748874 26.907248
karlsson 15.874508 28.266588 19.364917 19.078784 17.000000 18.193405 18.303005
kyle 12.727922 29.883106 16.155494 16.248077 14.035669 10.440307 15.329710
last 14.832397 29.715316 17.175564 17.262677 16.278821 14.933185 17.406895
lead 19.339080 28.896367 20.223748 18.973666 19.773720 21.424285 21.000000
left 18.027756 28.248894 18.761663 19.261360 18.165902 18.439089 18.330303
like 13.892444 28.879058 17.549929 17.349352 13.564660 12.165525 14.422205
made 20.976177 26.324893 23.043437 22.090722 19.621417 23.130067 20.904545
make 14.247807 29.765752 16.852300 16.941074 14.560220 11.747340 15.362291
mark 15.033296 28.583212 17.804494 16.062378 14.106736 14.525839 15.905974
mike 13.527749 30.983867 15.556349 17.058722 13.711309 13.341664 15.620499
minut 17.521415 28.319605 20.736441 16.522712 14.966630 18.055470 18.920888
miss 14.282857 29.376862 17.058722 17.146428 16.522712 13.453624 16.031220
net 17.406895 27.276363 20.099751 18.248288 17.888544 17.492856 19.798990
next 12.041595 28.000000 14.899664 15.588457 12.328828 10.862780 13.711309
nhl 14.387495 29.966648 16.792856 18.027756 15.684387 12.000000 16.792856
night 22.472205 27.820855 24.166092 22.825424 20.199010 23.832751 22.583180
notes 11.269428 28.000000 14.899664 14.525839 11.916375 9.273618 13.564660
one 16.643317 28.809721 19.390719 17.521415 16.970563 18.165902 15.620499
open 12.369317 27.856777 18.055470 16.093477 13.784049 11.045361 14.352700
ottawa 43.680659 37.775654 44.215382 42.332021 42.673177 48.135226 41.267421
pass 13.416408 28.231188 18.083141 16.673332 14.387495 12.922848 15.968719
past 13.266499 28.089144 16.093477 16.792856 15.588457 13.964240 16.217275
period 35.327043 35.171011 34.856850 33.793490 31.511903 36.864617 35.142567
play 47.434165 44.056782 47.444705 45.825757 44.034078 49.729267 46.292548
point 21.679483 30.149627 22.869193 22.494444 22.561028 24.758837 20.663978
power 16.000000 29.715316 17.058722 16.792856 15.066519 13.892444 17.748239
puck 17.435596 27.221315 20.518285 17.606817 18.520259 18.894444 19.570386
put 13.856406 28.231188 17.691806 15.362291 14.106736 12.767145 15.968719
right 16.881943 27.531800 18.330303 18.027756 16.792856 17.320508 17.378147
ryan 18.681542 28.354894 21.260292 18.466185 17.204651 17.832555 17.663522
said 41.749251 37.040518 42.308392 38.923001 38.961519 44.766059 38.755645
saturday 13.711309 30.740852 18.947295 18.330303 14.177447 12.449900 16.340135
save 18.894444 26.495283 22.360680 19.773720 16.552945 19.748418 19.798990
score 44.079474 42.988371 44.788391 42.272923 41.424630 47.958315 43.335897
scratch 13.114877 29.883106 18.027756 17.606817 13.228757 12.449900 16.763055
season 24.939928 32.726136 25.079872 26.907248 24.718414 26.095977 26.172505
senat 65.207362 59.690870 66.113539 63.796552 62.377881 68.242216 64.614240
shot 36.193922 34.741906 36.728735 34.842503 34.073450 38.742741 37.722672
start 16.155494 26.305893 17.606817 19.209373 17.029386 15.937377 17.888544
stop 14.352700 26.019224 17.291616 17.832555 14.662878 14.106736 15.905974
straight 15.459625 30.463092 16.613248 17.860571 16.248077 14.071247 16.186414
team 21.142375 29.017236 22.759613 22.068076 22.000000 22.583180 20.396078
think 16.431677 29.949958 19.000000 17.720045 15.968719 16.462078 17.578396
three 16.093477 29.427878 16.248077 19.672316 16.309506 15.362291 17.549929
thursday 13.379088 30.298515 16.309506 17.406895 15.362291 11.832160 15.937377
tie 18.411953 30.430248 20.049938 16.941074 18.547237 18.920888 18.654758
time 19.000000 29.325757 20.396078 18.466185 19.442222 19.442222 18.547237
tuesday 12.806248 28.442925 17.349352 16.431677 15.132746 11.704700 16.340135
two 22.405357 28.195744 24.596748 23.237900 22.605309 25.475478 24.677925
way 18.708287 28.965497 21.236761 17.944358 16.643317 18.193405 19.519221
went 13.076697 29.631065 16.124515 17.117243 15.099669 11.832160 16.000000
win 18.000000 27.766887 20.808652 17.492856 17.406895 20.074860 19.313208
came chanc coach come craig didnt end
anderson
assist
back
beat
befor
boucher
came
chanc 14.594520
coach 15.491933 16.031220
come 13.601471 13.928388 14.177447
craig 10.392305 11.357817 14.696938 9.848858
didnt 13.674794 15.874508 16.822604 14.000000 11.874342
end 13.190906 15.066519 16.552945 13.453624 10.295630 15.264338
erik 11.401754 12.041595 14.422205 10.440307 7.071068 12.845233 10.862780
final 15.491933 16.155494 17.492856 14.730920 12.727922 16.462078 15.033296
five 13.000000 13.711309 15.968719 11.401754 11.180340 15.362291 13.453624
four 15.033296 16.278821 15.937377 15.524175 13.784049 15.716234 16.370706
gave 13.674794 14.832397 16.643317 13.038405 10.908712 15.427249 13.674794
get 22.293497 23.021729 17.691806 22.181073 21.794495 23.108440 22.693611
give 13.601471 15.684387 15.716234 14.422205 10.630146 14.212670 13.228757
goal 59.891569 61.188234 54.027771 60.909769 60.950800 60.199668 60.885138
good 20.223748 17.663522 17.058722 19.235384 18.894444 20.248457 20.174241
got 20.880613 21.236761 18.814888 21.000000 20.099751 21.236761 20.297783
great 14.764823 15.000000 16.733201 11.618950 10.677078 14.525839 14.282857
guy 17.204651 17.406895 13.490738 16.462078 15.099669 18.574176 15.874508
hoffman 18.248288 19.131126 19.570386 17.378147 16.583124 18.654758 17.521415
host 12.000000 13.527749 14.764823 11.180340 8.944272 14.247807 12.409674
injuri 14.071247 14.933185 15.033296 12.845233 12.000000 17.578396 12.247449
just 28.530685 29.512709 24.331050 28.124722 27.712813 27.838822 30.033315
karlsson 17.578396 18.000000 18.574176 17.663522 15.779734 18.708287 17.804494
kyle 11.958261 12.165525 16.217275 10.677078 7.549834 13.856406 10.723805
last 16.093477 17.832555 14.662878 15.620499 14.106736 17.549929 16.093477
lead 21.142375 23.874673 19.261360 21.213203 20.808652 21.071308 21.189620
left 17.029386 17.521415 17.088007 18.681542 16.613248 18.734994 18.654758
like 14.000000 14.317821 14.352700 13.601471 10.392305 15.842980 14.352700
made 22.338308 22.715633 19.052559 21.023796 20.904545 22.449944 22.248595
make 13.190906 15.000000 14.764823 12.288206 10.198039 14.730920 13.490738
mark 15.000000 14.966630 14.456832 13.856406 11.958261 16.248077 13.747727
mike 15.362291 15.394804 15.033296 13.820275 13.341664 15.264338 14.422205
minut 18.654758 18.788294 16.673332 18.027756 16.970563 16.763055 18.547237
miss 14.247807 14.422205 16.881943 13.416408 10.630146 16.186414 12.288206
net 17.492856 18.947295 20.000000 19.104973 16.370706 19.570386 18.330303
next 12.409674 12.845233 14.142136 9.848858 8.717798 13.304135 11.575837
nhl 13.038405 15.459625 16.309506 12.529964 10.000000 15.066519 12.961481
night 25.337719 23.769729 19.646883 23.473389 23.151674 23.769729 24.207437
notes 10.770330 11.180340 13.638182 9.219544 5.830952 12.288206 9.591663
one 16.792856 18.520259 15.362291 17.175564 16.124515 17.860571 18.110770
open 12.489996 14.525839 14.764823 12.922848 9.486833 13.601471 13.038405
ottawa 46.829478 46.518813 39.812058 46.173586 46.743984 46.346521 47.042534
pass 12.529964 13.038405 16.340135 12.961481 10.148892 12.884099 13.820275
past 13.527749 13.856406 16.093477 12.328828 10.908712 15.033296 13.892444
period 36.124784 36.959437 32.634338 38.314488 37.349699 36.359318 38.249183
play 50.309045 49.739320 44.215382 50.299105 49.789557 48.887626 51.078371
point 25.903668 25.729361 21.610183 23.409400 23.769729 24.166092 24.392622
power 15.842980 16.000000 17.691806 14.764823 12.688578 15.556349 15.132746
puck 17.635192 19.949937 19.000000 18.220867 16.941074 19.493589 18.411953
put 13.453624 13.856406 16.093477 12.328828 9.539392 12.569805 13.601471
right 16.552945 17.916473 17.320508 16.703293 14.142136 17.916473 16.370706
ryan 17.435596 18.083141 17.146428 17.860571 15.556349 17.058722 18.000000
said 43.680659 44.147480 36.083237 43.988635 44.362146 44.056782 45.011110
saturday 14.730920 14.422205 16.155494 12.884099 10.816654 16.248077 13.820275
save 20.591260 20.615528 18.275667 19.723083 18.654758 20.223748 20.199010
score 47.581509 47.947888 40.274061 47.233463 47.958315 46.335731 48.290786
scratch 14.456832 14.422205 16.401219 12.884099 11.269428 14.212670 14.247807
season 27.513633 28.844410 24.103942 27.239677 27.000000 28.071338 27.258026
senat 69.159237 68.161573 61.457302 67.911707 69.101375 68.278840 69.462220
shot 39.382737 38.729833 35.028560 39.370039 38.561639 38.314488 39.255573
start 16.970563 17.748239 16.852300 16.278821 14.899664 17.406895 15.937377
stop 13.892444 14.491377 15.524175 14.282857 11.618950 14.212670 14.525839
straight 13.711309 14.798649 16.431677 13.527749 11.747340 16.763055 13.784049
team 24.000000 22.781571 18.708287 21.236761 21.633308 23.216374 21.863211
think 16.401219 16.613248 16.401219 15.033296 14.730920 15.620499 15.779734
three 16.970563 18.574176 16.673332 16.703293 14.899664 17.916473 16.792856
thursday 13.266499 14.662878 16.000000 11.789826 10.583005 14.317821 12.328828
tie 19.235384 20.124612 17.378147 17.860571 17.832555 19.621417 19.131126
time 18.439089 18.947295 16.792856 18.947295 17.029386 18.411953 19.493589
tuesday 13.964240 15.556349 17.406895 12.884099 10.816654 13.416408 13.304135
two 25.903668 26.645825 23.769729 26.076810 24.959968 26.191602 26.134269
way 19.104973 18.973666 17.804494 16.186414 15.588457 16.370706 17.860571
went 14.071247 14.456832 16.370706 13.076697 9.486833 15.000000 11.747340
win 20.712315 20.736441 18.788294 19.026298 18.627936 19.849433 20.024984
erik final five four gave get give
anderson
assist
back
beat
befor
boucher
came
chanc
coach
come
craig
didnt
end
erik
final 13.564660
five 11.357817 15.968719
four 13.416408 16.431677 15.000000
gave 11.000000 16.401219 12.649111 15.968719
get 22.516660 23.515952 23.452079 22.693611 23.622024
give 11.789826 15.000000 15.099669 15.716234 13.190906 22.271057
goal 60.588778 60.506198 60.811183 57.210139 60.249481 50.774009 59.816386
good 18.841444 22.427661 19.026298 20.469489 21.725561 20.880613 20.000000
got 19.544820 21.166010 21.977261 21.494185 21.702534 22.248595 21.095023
great 11.045361 15.684387 14.387495 17.262677 13.964240 22.912878 14.177447
guy 15.684387 18.055470 16.941074 18.654758 15.968719 18.411953 17.464249
hoffman 16.278821 18.894444 17.832555 20.074860 17.146428 24.413111 17.262677
host 8.717798 15.874508 10.908712 13.928388 11.000000 22.605309 12.369317
injuri 11.832160 15.620499 13.076697 16.673332 14.866069 21.840330 16.155494
just 28.284271 28.913665 29.748950 28.000000 27.622455 24.718414 28.792360
karlsson 12.609520 18.841444 18.708287 17.521415 18.920888 23.874673 18.654758
kyle 7.937254 14.035669 10.295630 13.379088 10.000000 23.874673 12.569805
last 14.035669 16.278821 14.764823 16.093477 17.549929 20.736441 15.491933
lead 20.904545 23.130067 22.090722 21.142375 19.646883 21.023796 20.976177
left 16.431677 18.439089 18.627936 18.165902 18.627936 19.974984 19.416488
like 11.135529 15.748016 13.000000 15.099669 13.000000 22.068076 13.892444
made 21.330729 24.310492 22.759613 22.561028 23.790755 22.135944 23.151674
make 10.583005 15.427249 12.288206 15.491933 13.527749 23.473389 13.892444
mark 11.789826 16.703293 14.966630 15.842980 13.638182 21.908902 14.899664
mike 12.247449 15.165751 13.000000 15.231546 14.106736 21.610183 13.747727
minut 17.088007 17.435596 18.138357 18.330303 19.519221 21.377558 18.681542
miss 10.816654 15.329710 13.038405 16.401219 13.266499 23.194827 14.491377
net 16.492423 20.199010 19.209373 16.911535 17.916473 21.470911 19.052559
next 8.944272 14.212670 10.908712 13.711309 11.532563 21.047565 12.529964
nhl 10.583005 15.033296 12.449900 15.165751 13.964240 23.259407 12.922848
night 23.237900 25.495098 23.937418 23.237900 25.475478 22.022716 23.727621
notes 6.633250 12.727922 9.848858 12.884099 10.344080 21.047565 10.440307
one 17.204651 17.204651 18.303005 18.814888 17.691806 19.000000 18.357560
open 10.198039 14.696938 13.820275 15.748016 12.609520 22.203603 12.922848
ottawa 47.148701 45.771170 47.937459 45.836667 47.770284 36.606010 47.749346
pass 9.848858 14.798649 13.190906 14.525839 14.142136 23.452079 14.142136
past 10.148892 15.394804 12.806248 14.594520 12.727922 22.135944 14.491377
period 37.775654 36.619667 37.894591 35.651087 38.366652 28.530685 37.549967
play 50.269275 49.628621 50.852729 48.610698 51.166395 40.074930 49.678969
point 23.216374 25.119713 24.413111 22.605309 23.874673 24.698178 25.729361
power 14.035669 16.583124 15.099669 16.340135 14.696938 23.537205 15.099669
puck 17.578396 20.273135 18.439089 18.574176 19.078784 20.149442 19.131126
put 9.848858 13.892444 13.416408 14.730920 12.806248 22.583180 13.341664
right 14.352700 18.110770 16.703293 15.684387 18.248288 21.283797 17.972201
ryan 17.320508 18.708287 18.520259 18.867962 17.578396 22.516660 16.340135
said 44.519659 43.794977 45.044423 42.614552 45.088801 31.288976 44.485953
saturday 10.246951 16.703293 12.409674 15.905974 14.000000 22.715633 13.856406
save 19.646883 21.771541 19.974984 21.447611 21.189620 21.142375 20.712315
score 47.770284 48.207883 49.081565 45.978256 48.280431 38.196859 46.786750
scratch 11.000000 15.968719 13.416408 15.394804 14.422205 22.135944 13.490738
season 26.776856 28.618176 26.907248 26.134269 27.313001 25.651511 27.784888
senat 69.072426 68.212902 69.598851 66.324958 68.760454 56.160484 67.970582
shot 38.923001 38.509739 39.115214 37.269290 39.824616 31.336879 39.572718
start 15.811388 18.814888 16.522712 19.849433 15.842980 22.022716 16.941074
stop 12.449900 15.132746 15.033296 14.177447 15.297059 20.099751 16.370706
straight 11.661904 16.492423 12.609520 16.431677 14.387495 21.189620 14.662878
team 22.271057 22.226111 23.685439 23.021729 23.515952 24.310492 22.516660
think 14.247807 17.406895 16.000000 17.860571 16.248077 22.494444 17.029386
three 15.165751 18.275667 15.588457 16.852300 15.459625 21.377558 15.968719
thursday 10.099505 15.684387 12.449900 15.099669 13.820275 23.388031 13.747727
tie 18.110770 21.679483 18.520259 19.798990 17.406895 24.020824 19.621417
time 18.165902 17.776389 18.947295 20.199010 20.518285 21.931712 18.303005
tuesday 11.357817 16.093477 14.142136 16.401219 13.416408 23.452079 13.114877
two 25.000000 24.879711 26.115130 25.159491 25.690465 24.372115 27.202941
way 16.583124 19.519221 18.439089 20.074860 19.748418 22.405357 18.000000
went 10.488088 14.832397 12.609520 15.811388 12.206556 23.259407 13.304135
win 18.788294 20.124612 19.899749 20.273135 18.814888 21.447611 19.849433
goal good got great guy hoffman host
anderson
assist
back
beat
befor
boucher
came
chanc
coach
come
craig
didnt
end
erik
final
five
four
gave
get
give
goal
good 56.462377
got 52.469038 22.737634
great 61.587336 19.209373 23.021729
guy 54.927225 18.788294 19.339080 16.613248
hoffman 56.885851 22.847319 20.174241 19.000000 20.518285
host 60.390397 19.261360 20.297783 11.661904 15.811388 17.117243
injuri 59.455866 19.874607 20.000000 15.811388 15.427249 17.916473 12.727922
just 47.423623 26.925824 25.495098 28.354894 24.454039 29.748950 28.390139
karlsson 56.515485 20.639767 19.672316 17.916473 18.357560 21.071308 17.117243
kyle 61.546730 20.639767 20.174241 12.609520 16.217275 16.970563 8.426150
last 56.444663 20.346990 19.874607 17.058722 16.763055 20.099751 14.730920
lead 49.030603 23.323808 21.189620 22.338308 18.681542 21.494185 21.424285
left 54.451814 20.566964 19.287302 19.078784 17.029386 22.113344 18.384776
like 60.638272 20.518285 21.166010 13.564660 15.620499 18.627936 10.954451
made 53.591044 19.287302 20.952327 21.931712 20.952327 23.958297 21.840330
make 60.704201 19.364917 21.071308 13.490738 16.248077 17.521415 11.135529
mark 57.792733 19.849433 19.723083 13.453624 15.329710 17.720045 12.041595
mike 56.347138 17.691806 17.663522 15.231546 15.684387 14.662878 12.409674
minut 54.064776 19.570386 18.110770 19.235384 17.606817 19.974984 18.708287
miss 61.481705 21.260292 20.273135 15.000000 15.779734 17.944358 12.041595
net 56.258333 21.563859 20.736441 19.595918 19.595918 20.952327 16.431677
next 58.711157 17.804494 18.867962 11.313708 14.628739 16.155494 8.000000
nhl 61.229078 20.856654 22.494444 12.649111 16.186414 19.313208 11.224972
night 49.568135 21.702534 21.908902 24.207437 20.832667 24.310492 22.405357
notes 59.573484 18.083141 18.708287 10.295630 14.560220 16.401219 7.071068
one 55.398556 19.157244 19.339080 18.275667 16.552945 21.330729 17.492856
open 59.924953 19.723083 19.849433 12.649111 16.124515 17.349352 11.135529
ottawa 38.858718 41.352146 39.331921 46.914816 39.433488 45.497253 47.675990
pass 59.933296 19.798990 19.672316 12.369317 17.804494 18.330303 11.874342
past 59.682493 20.099751 19.621417 13.892444 17.117243 19.390719 12.369317
period 43.034870 33.585711 31.575307 37.696154 32.542280 36.878178 37.536649
play 43.405069 43.405069 40.853396 50.901866 43.714986 48.476799 50.566788
point 54.277067 25.258662 24.062419 24.839485 22.113344 25.651511 23.643181
power 58.702640 20.445048 20.024984 16.822604 18.520259 20.000000 13.747727
puck 53.981478 21.494185 19.924859 19.570386 17.804494 21.307276 17.748239
put 60.909769 20.639767 20.124612 12.449900 17.578396 16.733201 10.630146
right 56.204982 20.124612 19.390719 17.204651 17.492856 20.273135 16.124515
ryan 57.714816 19.874607 21.400935 18.110770 17.606817 19.261360 17.606817
said 36.041643 38.196859 36.469165 44.407207 36.769553 43.577517 44.497191
saturday 60.663004 19.078784 20.371549 13.152946 16.941074 18.493242 11.090537
save 54.138711 18.841444 18.165902 20.099751 19.078784 21.748563 19.131126
score 29.883106 42.438190 38.884444 48.062459 41.737274 46.119410 47.644517
scratch 60.348985 17.549929 20.712315 12.767145 17.000000 18.654758 11.269428
season 49.091751 27.202941 25.317978 27.073973 24.879711 28.460499 25.903668
senat 41.665333 61.351447 59.388551 67.801180 61.911227 66.332496 68.490875
shot 39.293765 34.698703 32.109189 39.937451 34.132096 37.947332 39.686270
start 57.991379 19.672316 22.627417 16.673332 17.146428 18.466185 16.062378
stop 57.393379 18.439089 18.894444 15.968719 17.058722 18.814888 14.106736
straight 59.607047 21.142375 22.000000 15.033296 16.552945 18.083141 12.569805
team 56.080300 22.605309 24.124676 23.537205 20.396078 24.718414 22.090722
think 59.816386 19.493589 19.570386 15.842980 17.748239 19.949937 15.066519
three 55.145263 20.712315 21.354157 17.776389 16.309506 19.672316 15.099669
thursday 60.819405 18.947295 21.166010 12.884099 18.330303 16.217275 9.695360
tie 54.470175 22.293497 22.181073 19.748418 18.000000 21.563859 18.165902
time 57.576037 19.261360 21.540659 18.493242 18.601075 21.563859 18.920888
tuesday 59.899917 20.832667 20.760539 12.609520 16.822604 18.973666 10.535654
two 48.435524 25.139610 23.937418 26.400758 23.259407 26.683328 25.865034
way 59.581876 21.494185 24.515301 16.217275 19.261360 21.400935 17.175564
went 60.967204 19.052559 20.542639 13.114877 16.792856 18.027756 12.083046
win 54.644304 23.065125 21.095023 18.947295 18.627936 20.591260 18.841444
injuri just karlsson kyle last lead left
anderson
assist
back
beat
befor
boucher
came
chanc
coach
come
craig
didnt
end
erik
final
five
four
gave
get
give
goal
good
got
great
guy
hoffman
host
injuri
just 29.563491
karlsson 16.703293 27.331301
kyle 12.922848 29.376862 16.613248
last 14.866069 27.404379 18.275667 15.620499
lead 21.702534 23.600847 21.679483 22.181073 21.166010
left 16.733201 26.305893 17.860571 17.748239 17.521415 20.566964
like 14.212670 28.809721 17.691806 11.874342 15.588457 22.516660 19.339080
made 21.977261 25.592968 22.671568 22.494444 22.494444 23.366643 22.158520
make 13.341664 27.856777 17.521415 11.874342 14.035669 21.931712 18.867962
mark 14.035669 26.776856 17.088007 13.038405 15.874508 20.099751 18.574176
mike 14.352700 26.720778 17.691806 12.845233 14.106736 19.874607 17.663522
minut 19.339080 25.139610 18.788294 18.466185 18.411953 18.681542 19.183326
miss 11.445523 29.512709 17.378147 10.488088 15.099669 22.271057 17.860571
net 19.078784 25.258662 18.027756 17.349352 20.371549 22.338308 19.235384
next 12.083046 26.758176 16.522712 9.000000 13.453624 19.723083 17.262677
nhl 14.966630 29.086079 17.804494 10.344080 15.459625 23.259407 19.442222
night 23.664319 26.153394 23.600847 24.062419 23.000000 23.811762 24.124676
notes 11.575837 27.349589 14.933185 6.708204 13.152946 20.074860 16.000000
one 18.867962 24.535688 19.974984 17.691806 17.291616 20.371549 18.330303
open 14.696938 27.531800 16.462078 10.908712 15.394804 21.610183 18.654758
ottawa 45.265881 32.480764 42.848571 48.641546 43.772137 37.841776 40.632499
pass 14.798649 28.687977 15.684387 11.135529 16.309506 21.679483 15.779734
past 14.177447 27.367864 15.231546 11.313708 15.165751 22.090722 16.583124
period 37.161808 30.512293 35.580894 38.496753 36.606010 32.771939 32.756679
play 49.889879 37.749172 46.754679 51.380930 47.138095 43.863424 46.227697
point 22.956481 28.407745 22.627417 23.323808 24.166092 23.874673 24.718414
power 17.117243 27.802878 18.330303 13.038405 18.110770 21.213203 20.273135
puck 19.261360 23.937418 18.547237 17.663522 19.899749 20.542639 18.411953
put 15.329710 29.274562 17.204651 10.295630 15.362291 22.181073 16.881943
right 16.062378 26.907248 17.175564 14.662878 17.175564 21.424285 16.124515
ryan 17.492856 27.313001 21.931712 17.175564 18.303005 21.517435 18.601075
said 42.544095 30.626786 40.681691 46.076024 40.187063 36.290495 37.656341
saturday 14.317821 29.512709 16.792856 10.583005 16.124515 23.194827 19.104973
save 19.849433 26.000000 21.748563 19.773720 20.174241 22.427661 20.591260
score 47.053161 35.916570 43.485630 49.244289 44.799554 38.013156 43.127717
scratch 15.716234 28.372522 17.606817 12.489996 16.370706 22.405357 19.974984
season 26.362853 27.294688 27.820855 27.055499 22.671568 25.099801 25.592968
senat 67.860150 52.545219 65.099923 70.554943 65.360539 58.634461 63.442888
shot 38.327536 28.301943 34.583233 39.949969 36.986484 30.594117 33.421550
start 16.000000 26.832816 20.760539 16.703293 17.804494 21.142375 19.748418
stop 14.456832 26.589472 16.000000 12.409674 15.811388 20.591260 16.155494
straight 12.649111 29.698485 17.860571 11.874342 15.000000 22.649503 18.439089
team 20.639767 25.884358 24.186773 22.869193 21.236761 24.637370 24.248711
think 16.881943 27.440845 18.439089 14.764823 16.431677 22.360680 17.635192
three 17.320508 27.202941 18.734994 15.524175 16.822604 19.000000 20.000000
thursday 13.564660 29.698485 17.406895 10.908712 15.588457 22.912878 19.595918
tie 18.654758 25.139610 19.974984 17.691806 21.000000 20.273135 19.183326
time 19.235384 26.720778 20.469489 19.313208 18.788294 22.472205 19.748418
tuesday 15.132746 27.910571 17.435596 10.954451 16.062378 20.542639 18.894444
two 24.758837 23.000000 21.771541 26.000000 24.248711 23.108440 22.781571
way 19.261360 28.513155 21.118712 18.165902 18.654758 23.832751 21.047565
went 14.071247 28.035692 16.583124 9.746794 16.462078 21.283797 17.088007
win 20.566964 25.278449 21.260292 18.814888 20.149442 20.880613 20.663978
like made make mark mike minut miss
anderson
assist
back
beat
befor
boucher
came
chanc
coach
come
craig
didnt
end
erik
final
five
four
gave
get
give
goal
good
got
great
guy
hoffman
host
injuri
just
karlsson
kyle
last
lead
left
like
made 22.516660
make 12.961481 21.886069
mark 13.674794 21.118712 13.892444
mike 14.142136 21.047565 13.564660 14.247807
minut 18.493242 21.283797 18.547237 17.291616 15.427249
miss 13.076697 22.315914 13.152946 13.638182 14.247807 18.894444
net 18.165902 21.470911 18.920888 16.763055 18.330303 20.976177 17.972201
next 10.954451 19.261360 11.401754 12.124356 10.954451 16.673332 11.445523
nhl 12.884099 23.130067 13.341664 15.394804 14.491377 19.442222 13.892444
night 23.494680 20.273135 24.289916 22.693611 21.118712 21.400935 24.799194
notes 10.198039 20.074860 10.488088 10.908712 11.224972 16.431677 9.746794
one 17.663522 21.283797 17.832555 18.574176 17.262677 18.439089 18.841444
open 11.489125 21.142375 11.575837 12.767145 13.564660 16.733201 13.674794
ottawa 46.615448 38.444766 46.872167 44.766059 44.170126 40.755368 46.925473
pass 13.228757 21.725561 13.453624 13.266499 14.456832 17.464249 14.000000
past 13.527749 21.354157 13.964240 13.711309 14.177447 18.627936 12.806248
period 36.207734 31.016125 37.986840 35.355339 34.336569 31.638584 38.444766
play 49.203658 40.841156 50.129831 48.826222 47.000000 43.485630 50.059964
point 22.516660 24.657656 24.062419 22.803509 22.561028 24.677925 22.181073
power 16.703293 22.315914 15.588457 16.124515 16.217275 18.734994 15.874508
puck 18.248288 19.748418 19.519221 18.055470 18.083141 19.416488 18.165902
put 12.845233 22.671568 13.453624 13.711309 13.304135 17.916473 13.190906
right 16.492423 17.349352 15.684387 15.588457 15.748016 18.493242 16.583124
ryan 17.776389 20.904545 16.062378 18.193405 16.733201 19.798990 17.117243
said 43.726422 35.142567 44.090815 41.484937 41.279535 37.947332 44.754888
saturday 12.288206 21.260292 14.035669 14.696938 12.767145 18.788294 13.266499
save 19.646883 11.789826 18.867962 18.841444 17.320508 18.654758 20.174241
score 47.560488 40.410395 48.187135 45.022217 44.474712 40.620192 48.897853
scratch 12.688578 20.346990 13.379088 13.784049 12.449900 16.703293 14.899664
season 27.258026 26.153394 26.664583 26.457513 24.919872 26.514147 26.305893
senat 67.668309 57.393379 69.173694 65.238026 64.474801 61.229078 69.137544
shot 39.560081 33.286634 39.408121 38.105118 36.262929 31.953091 40.447497
start 16.309506 20.615528 15.937377 16.822604 15.874508 19.026298 17.117243
stop 15.394804 20.297783 14.456832 16.186414 14.525839 15.968719 15.165751
straight 12.727922 23.086793 13.564660 13.820275 14.422205 20.149442 11.000000
team 22.934690 23.685439 21.166010 20.760539 21.447611 22.494444 22.158520
think 17.406895 22.045408 15.459625 15.874508 15.066519 17.691806 17.262677
three 16.186414 23.216374 16.492423 17.804494 15.874508 18.165902 17.117243
thursday 13.114877 21.470911 12.569805 13.379088 13.711309 19.798990 13.304135
tie 19.748418 25.317978 19.748418 17.578396 18.547237 18.547237 19.313208
time 18.761663 23.130067 18.000000 17.972201 17.378147 18.000000 19.416488
tuesday 14.317821 21.023796 13.892444 14.491377 14.933185 18.303005 13.266499
two 26.627054 24.372115 25.475478 26.115130 23.388031 20.712315 25.298221
way 17.464249 22.090722 16.583124 17.776389 18.681542 19.874607 18.973666
went 13.928388 21.886069 13.490738 12.767145 13.564660 17.378147 13.152946
win 18.947295 22.315914 20.124612 18.493242 18.681542 18.894444 19.595918
net next nhl night notes one open
anderson
assist
back
beat
befor
boucher
came
chanc
coach
come
craig
didnt
end
erik
final
five
four
gave
get
give
goal
good
got
great
guy
hoffman
host
injuri
just
karlsson
kyle
last
lead
left
like
made
make
mark
mike
minut
miss
net
next 16.970563
nhl 19.544820 11.224972
night 24.289916 20.396078 24.041631
notes 15.556349 6.324555 10.099505 21.213203
one 19.899749 16.000000 18.330303 23.748684 15.811388
open 16.970563 11.135529 12.649111 23.151674 9.380832 17.262677
ottawa 44.440972 44.283180 48.052055 35.256205 45.967380 40.112342 46.249324
pass 16.881943 12.041595 13.892444 24.919872 9.848858 18.574176 11.532563
past 16.822604 11.704700 13.820275 24.228083 9.949874 17.233688 13.076697
period 32.664966 35.846897 38.223030 28.478062 36.069378 33.660065 36.207734
play 46.679760 47.275787 50.566788 38.144462 48.897853 44.305756 48.979588
point 24.145393 21.931712 25.475478 24.351591 22.248595 22.737634 24.020824
power 18.466185 13.601471 15.132746 23.853721 12.041595 17.691806 14.525839
puck 14.594520 16.941074 19.313208 22.605309 16.155494 18.627936 17.233688
put 16.217275 11.000000 13.379088 24.433583 9.110434 17.058722 11.445523
right 17.888544 14.212670 16.124515 22.405357 13.784049 18.220867 15.231546
ryan 21.725561 16.186414 18.110770 23.874673 15.937377 18.493242 16.124515
said 40.570926 42.142615 45.210618 32.649655 43.197222 37.202150 43.611925
saturday 18.193405 9.949874 11.958261 22.869193 9.643651 17.916473 13.000000
save 19.697716 16.852300 21.118712 19.235384 17.720045 19.026298 18.330303
score 44.090815 45.607017 48.846699 35.014283 46.497312 42.473521 47.180504
scratch 17.058722 10.535654 13.601471 21.748563 9.848858 19.209373 12.206556
season 29.171904 23.430749 26.551836 24.718414 25.357445 27.221315 27.549955
senat 63.647467 65.901442 69.992857 54.046276 67.372101 62.681736 67.933791
shot 35.199432 37.349699 39.812058 30.182777 37.960506 34.770677 38.716921
start 20.542639 14.282857 17.435596 21.447611 14.352700 19.949937 15.033296
stop 17.804494 12.369317 15.066519 21.886069 11.704700 15.716234 13.379088
straight 18.973666 11.224972 13.564660 23.958297 10.677078 18.439089 14.142136
team 24.083189 20.445048 23.537205 21.354157 20.736441 22.627417 22.583180
think 19.467922 15.132746 17.748239 23.600847 13.601471 16.881943 15.329710
three 21.633308 14.142136 15.811388 21.354157 14.212670 18.330303 15.937377
thursday 16.733201 10.862780 13.928388 23.452079 9.165151 18.110770 12.328828
tie 21.118712 17.663522 19.595918 23.958297 16.911535 19.442222 19.287302
time 21.447611 16.733201 19.748418 22.715633 16.370706 18.000000 17.888544
tuesday 18.841444 10.440307 13.304135 22.649503 9.746794 18.357560 12.609520
two 23.853721 24.433583 26.814175 24.269322 24.186773 23.000000 24.839485
way 20.663978 16.340135 17.972201 23.558438 15.329710 19.672316 17.464249
went 17.320508 11.747340 12.961481 24.819347 9.380832 16.733201 11.916375
win 20.952327 17.233688 20.566964 21.840330 17.464249 20.663978 19.621417
ottawa pass past period play point power
anderson
assist
back
beat
befor
boucher
came
chanc
coach
come
craig
didnt
end
erik
final
five
four
gave
get
give
goal
good
got
great
guy
hoffman
host
injuri
just
karlsson
kyle
last
lead
left
like
made
make
mark
mike
minut
miss
net
next
nhl
night
notes
one
open
ottawa
pass 46.882833
past 46.021734 11.661904
period 35.185224 36.660606 37.202150
play 31.176915 50.398413 50.338852 33.763886
point 39.648455 25.884358 22.934690 33.734256 45.166359
power 46.021734 15.620499 15.620499 36.578682 46.260134 25.019992
puck 40.865633 17.663522 16.492423 31.432467 45.099889 23.237900 20.099751
put 47.539457 11.313708 12.083046 37.229021 50.970580 23.874673 15.099669
right 42.860238 15.198684 15.842980 33.630343 46.400431 21.656408 17.635192
ryan 43.023250 17.804494 17.972201 35.256205 47.634021 24.433583 19.672316
said 22.825424 44.170126 43.874822 28.442925 29.342802 38.091994 43.943145
saturday 47.201695 14.071247 13.114877 37.282704 49.030603 24.454039 15.937377
save 39.281039 19.416488 19.824228 30.083218 41.267421 23.130067 20.273135
score 29.647934 47.339202 47.738873 30.282008 32.695565 41.916584 46.249324
scratch 46.626173 12.961481 14.212670 35.383612 47.560488 26.000000 15.297059
season 38.078866 27.784888 26.267851 36.027767 42.000000 28.071338 27.349589
senat 40.249224 68.381284 67.896981 45.232732 37.523326 60.514461 66.947741
shot 30.528675 39.698866 38.626416 28.705400 34.205263 33.436507 37.013511
start 43.783559 17.635192 17.578396 35.114100 47.717921 24.758837 19.000000
stop 43.794977 14.696938 14.000000 34.117444 47.265209 21.213203 15.684387
straight 46.378875 14.387495 12.206556 36.810325 50.249378 23.302360 16.941074
team 40.187063 24.310492 24.515301 34.856850 43.943145 22.383029 23.130067
think 45.934736 15.937377 14.696938 35.468296 48.867167 23.706539 18.493242
three 43.439613 17.578396 16.941074 34.307434 45.967380 22.649503 16.278821
thursday 48.114447 13.000000 13.304135 37.722672 51.156622 23.515952 15.716234
tie 41.581246 19.824228 18.734994 34.828150 47.613023 22.516660 20.124612
time 41.892720 20.124612 18.681542 33.749074 44.911023 24.515301 20.223748
tuesday 46.432747 12.727922 14.142136 37.309516 49.152823 23.579652 15.165751
two 34.785054 25.534291 24.535688 30.561414 41.158231 25.690465 26.191602
way 44.676616 18.867962 18.814888 36.523965 48.145612 24.576411 19.899749
went 47.233463 13.076697 12.688578 37.669616 50.507425 23.727621 14.035669
win 40.249224 20.346990 19.339080 32.526912 45.321077 21.633308 21.540659
puck put right ryan said saturday save
anderson
assist
back
beat
befor
boucher
came
chanc
coach
come
craig
didnt
end
erik
final
five
four
gave
get
give
goal
good
got
great
guy
hoffman
host
injuri
just
karlsson
kyle
last
lead
left
like
made
make
mark
mike
minut
miss
net
next
nhl
night
notes
one
open
ottawa
pass
past
period
play
point
power
puck
put 18.055470
right 16.583124 16.278821
ryan 18.947295 16.941074 18.275667
said 36.837481 44.732538 39.673669 40.298883
saturday 17.663522 13.490738 15.968719 18.520259 44.305756
save 17.916473 20.074860 16.613248 18.493242 35.608988 18.574176
score 41.653331 47.717921 44.833024 45.585085 24.779023 47.444705 40.620192
scratch 17.888544 13.190906 17.233688 19.104973 43.462628 11.575837 18.138357
season 28.600699 28.035692 27.184554 26.814175 36.455452 26.343880 25.942244
senat 61.886994 69.195376 64.691576 65.612499 36.345564 67.527772 58.489315
shot 32.954514 39.572718 34.307434 37.000000 28.053520 38.600518 32.572995
start 20.808652 16.941074 18.275667 18.439089 41.206796 18.193405 19.390719
stop 18.220867 13.416408 14.662878 16.583124 41.048752 14.422205 18.138357
straight 18.411953 14.106736 16.492423 18.547237 43.886217 13.453624 21.213203
team 24.269322 23.473389 22.315914 24.248711 36.742346 24.269322 22.934690
think 18.165902 14.899664 17.578396 18.193405 41.460825 16.124515 19.570386
three 20.469489 17.175564 19.078784 19.646883 40.620192 17.635192 21.354157
thursday 18.681542 11.532563 16.431677 18.110770 45.343136 12.288206 18.973666
tie 18.411953 18.357560 19.899749 21.118712 39.140772 20.074860 23.151674
time 20.760539 18.466185 19.026298 20.880613 39.166312 18.627936 21.213203
tuesday 19.078784 13.038405 17.406895 17.916473 44.124823 14.491377 19.000000
two 21.863211 26.191602 24.269322 25.709920 32.664966 25.612497 23.000000
way 20.099751 16.673332 18.841444 20.024984 42.414620 17.088007 20.760539
went 18.357560 11.789826 15.231546 17.776389 44.654227 12.845233 19.026298
win 18.220867 19.131126 19.313208 21.794495 37.509999 18.708287 20.760539
score scratch season senat shot start stop
anderson
assist
back
beat
befor
boucher
came
chanc
coach
come
craig
didnt
end
erik
final
five
four
gave
get
give
goal
good
got
great
guy
hoffman
host
injuri
just
karlsson
kyle
last
lead
left
like
made
make
mark
mike
minut
miss
net
next
nhl
night
notes
one
open
ottawa
pass
past
period
play
point
power
puck
put
right
ryan
said
saturday
save
score
scratch 46.227697
season 41.097445 27.495454
senat 36.455452 65.984847 58.000000
shot 30.643107 38.314488 35.721142 45.077711
start 45.431267 16.522712 25.670995 64.876806 37.134889
stop 44.866469 15.748016 26.000000 65.650590 33.882149 16.822604
straight 47.623524 15.066519 26.589472 68.942005 40.435133 16.792856 15.842980
team 41.133928 22.248595 27.694765 59.991666 35.623026 20.000000 22.338308
think 46.141088 16.370706 29.051678 66.693328 37.175261 18.894444 14.491377
three 43.726422 17.804494 24.186773 65.306967 36.235342 17.320508 15.264338
thursday 48.332184 12.688578 27.440845 68.767725 40.187063 16.492423 14.247807
tie 40.841156 20.074860 28.442925 64.054664 34.423829 20.976177 19.209373
time 44.429720 17.972201 28.757608 64.007812 34.885527 20.396078 17.406895
tuesday 47.212287 12.961481 24.819347 67.334983 39.242834 15.588457 15.033296
two 36.728735 25.495098 27.129320 56.991227 29.120440 24.799194 22.405357
way 45.354162 17.378147 30.000000 65.696271 37.175261 20.371549 18.000000
went 48.352870 13.820275 28.124722 69.318107 38.301436 17.088007 12.922848
win 41.581246 18.761663 25.059928 61.562976 32.771939 19.924859 19.183326
straight team think three thursday tie time
anderson
assist
back
beat
befor
boucher
came
chanc
coach
come
craig
didnt
end
erik
final
five
four
gave
get
give
goal
good
got
great
guy
hoffman
host
injuri
just
karlsson
kyle
last
lead
left
like
made
make
mark
mike
minut
miss
net
next
nhl
night
notes
one
open
ottawa
pass
past
period
play
point
power
puck
put
right
ryan
said
saturday
save
score
scratch
season
senat
shot
start
stop
straight
team 23.494680
think 17.406895 22.158520
three 15.362291 22.803509 18.947295
thursday 13.928388 22.978251 15.524175 18.165902
tie 19.235384 21.908902 20.371549 20.099751 19.798990
time 18.920888 23.194827 17.117243 19.849433 19.235384 21.071308
tuesday 14.798649 21.236761 17.088007 16.155494 12.845233 18.734994 20.420578
two 26.362853 26.438608 24.939928 23.043437 26.362853 23.727621 24.677925
way 18.520259 22.427661 18.547237 21.330729 16.822604 20.322401 20.566964
went 14.352700 22.803509 13.601471 16.000000 12.489996 17.720045 17.029386
win 19.052559 22.693611 18.867962 20.322401 19.773720 17.972201 21.189620
tuesday two way went
anderson
assist
back
beat
befor
boucher
came
chanc
coach
come
craig
didnt
end
erik
final
five
four
gave
get
give
goal
good
got
great
guy
hoffman
host
injuri
just
karlsson
kyle
last
lead
left
like
made
make
mark
mike
minut
miss
net
next
nhl
night
notes
one
open
ottawa
pass
past
period
play
point
power
puck
put
right
ryan
said
saturday
save
score
scratch
season
senat
shot
start
stop
straight
team
think
three
thursday
tie
time
tuesday
two 25.922963
way 18.814888 26.870058
went 14.106736 24.879711 17.578396
win 19.183326 22.538855 19.339080 19.313208
With the distance matrix created, the next step is to cluster the terms using the distance matrix
# Build the hc object
hc.90 = hclust(dist_90)
#hc.95 = hclust(dist_95)
#hc.99 = hclust(dist_99)
# Plot the dendograms
plot(hc.90)
#plot(hc.95)
#plot(hc.99)
library(dendextend)
---------------------
Welcome to dendextend version 1.8.0
Type citation('dendextend') for how to cite the package.
Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
Or contact: <tal.galili@gmail.com>
To suppress this message use: suppressPackageStartupMessages(library(dendextend))
---------------------
Attaching package: ‘dendextend’
The following object is masked from ‘package:qdap’:
%>%
The following object is masked from ‘package:stats’:
cutree
It's possible to improve the visualization by highlighting certain parts of the dendogram.
# Build hcd
hcd.90 = as.dendrogram(hc.90)
#hcd.95 = as.dendrogram(hc.95)
#hcd.99 = as.dendrogram(hc.99)
# labels
labels(hcd.90)
#labels(hcd.95)
#labels(hcd.99)
hcd.90 = branches_attr_by_labels(hcd.90,c("boucher"),"darkred")
plot(hcd.90, main="AP Game Recaps - Dendrogram")
rect.dendrogram(hcd.90,k=10,border="darkred")
The term document matrix can also be used to find terms that are associated with each other (i.e. appear with each other) across documents. Here we see which other terms are associated with the term 'karlsson', as visualized using a dotplot.
associations_EK <- findAssocs(AP.recaps_tdm, "karlsson", 0.33)
associations_EK
associations_EK.df <- list_vect2df(associations_EK)[,2:3] # requires qdap
attributes(associations_EK)
library(ggplot2)
ggplot(associations_EK.df,aes(y=associations_EK.df[,1])) +
geom_point(aes(x=associations_EK.df[,2]), data=associations_EK.df, size = 3)
Attaching package: ‘ggplot2’
The following object is masked from ‘package:qdapRegex’:
%+%
The following object is masked from ‘package:NLP’:
annotate
#ALSO CLUSTERING NOT OF TERMS BUT OF DOCUMENTS, WITH THE DTM
Other frequency weights in DTM or TDM:
#Redo the previous exercise without removing "Ottawa" or "Senators"
# Create a function to clean the corpus
clean_corpus_hockey <- function(corpus){
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("en"),"game", "first", "second", "third"))
return(corpus)
}
# Apply your customized function to the AP.recaps.corpus: clean_corp.AP.recaps
clean_corp.AP.recaps.1 <- clean_corpus_hockey(AP.recaps.corpus)
# Create a TDM from clean_corp.AP.recaps
AP.recaps.tf_tdm.1 <- TermDocumentMatrix(clean_corp.AP.recaps.1)
# Create a tf-idf DTM from clean_corp.AP.recaps
AP.recaps.tfidf_dtm.1 = DocumentTermMatrix(clean_corp.AP.recaps.1, control=list(weighting=weightTfIdf))
# Remove sparse terms
AP.recaps.tf_tdm.1 <- removeSparseTerms(AP.recaps.tf_tdm.1,sparse=.7)
# Apply tf-idf weighting
AP.recaps.tfidf_tdm.1 = TermDocumentMatrix(clean_corp.AP.recaps.1, control=list(weighting=weightTfIdf))
# Remove sparse terms
AP.recaps.tfidf_tdm.1 <- removeSparseTerms(AP.recaps.tfidf_tdm.1,sparse=.7)
AP.recaps.tf_tdm.1_m = as.matrix(AP.recaps.tf_tdm.1)
AP.recaps.tfidf_tdm.1_m = as.matrix(AP.recaps.tfidf_tdm.1)
At this point we have two new term document matrices, based on term frequency (tf) and term frequency-inverse document frequency (tf-idf)
head(AP.recaps.tf_tdm.1_m)
head(AP.recaps.tfidf_tdm.1_m)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ⋯ | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| advantag | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 2 | 0 |
| also | 2 | 0 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | ⋯ | 4 | 1 | 1 | 1 | 0 | 2 | 0 | 1 | 2 | 0 |
| anderson | 3 | 3 | 0 | 2 | 3 | 5 | 1 | 13 | 6 | 1 | ⋯ | 4 | 3 | 5 | 4 | 5 | 2 | 2 | 2 | 10 | 6 |
| assist | 1 | 0 | 1 | 2 | 0 | 2 | 2 | 0 | 2 | 0 | ⋯ | 4 | 1 | 2 | 1 | 1 | 0 | 2 | 4 | 0 | 2 |
| away | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ⋯ | 0 | 1 | 0 | 4 | 1 | 1 | 1 | 1 | 1 | 0 |
| back | 0 | 3 | 0 | 1 | 0 | 0 | 2 | 1 | 1 | 0 | ⋯ | 4 | 5 | 1 | 1 | 3 | 2 | 1 | 1 | 4 | 1 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ⋯ | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| advantag | 0.000000000 | 0.004900412 | 0.000000000 | 0.000000000 | 0.005726324 | 0.000000000 | 0.000000000 | 0.000000000 | 0.000000000 | 0.000000000 | ⋯ | 0.000000000 | 0.000000000 | 0.004012936 | 0.003367684 | 0.000000000 | 0.003443533 | 0.000000000 | 0.000000000 | 0.007127872 | 0.000000000 |
| also | 0.002133881 | 0.000000000 | 0.002188724 | 0.002133881 | 0.001754258 | 0.001117869 | 0.001496444 | 0.001598590 | 0.003357612 | 0.001334436 | ⋯ | 0.004503720 | 0.001040860 | 0.001229362 | 0.001031689 | 0.000000000 | 0.002109851 | 0.000000000 | 0.001128643 | 0.002183622 | 0.000000000 |
| anderson | 0.003336810 | 0.004695062 | 0.000000000 | 0.002224540 | 0.005486365 | 0.005826808 | 0.001560021 | 0.021664588 | 0.010500785 | 0.001391130 | ⋯ | 0.004695062 | 0.003255243 | 0.006407959 | 0.004302084 | 0.005548710 | 0.002199489 | 0.002287056 | 0.002353188 | 0.011381969 | 0.006382830 |
| assist | 0.001657117 | 0.000000000 | 0.003399412 | 0.003314233 | 0.000000000 | 0.003472430 | 0.004648397 | 0.000000000 | 0.005214868 | 0.000000000 | ⋯ | 0.006994944 | 0.001616609 | 0.003818762 | 0.001602366 | 0.001653350 | 0.000000000 | 0.003407373 | 0.007011799 | 0.000000000 | 0.003169822 |
| away | 0.000000000 | 0.004900412 | 0.000000000 | 0.000000000 | 0.000000000 | 0.000000000 | 0.000000000 | 0.005218186 | 0.000000000 | 0.000000000 | ⋯ | 0.000000000 | 0.003397619 | 0.000000000 | 0.013470735 | 0.003474837 | 0.003443533 | 0.003580629 | 0.003684165 | 0.003563936 | 0.000000000 |
| back | 0.000000000 | 0.005487968 | 0.000000000 | 0.001300111 | 0.000000000 | 0.000000000 | 0.003646956 | 0.001947948 | 0.002045694 | 0.000000000 | ⋯ | 0.005487968 | 0.006341652 | 0.001498028 | 0.001257156 | 0.003891468 | 0.002570940 | 0.001336648 | 0.001375298 | 0.005321666 | 0.001243461 |
Now that we've created the dtms using the different frequency weights discussed above, we can cluster them as before. This time the kmeans clustering technique will be applied. Note that k-means requires us to specify the number of clusters in advance.
AP.tf_tdm.df <- data.frame(AP.recaps.tf_tdm.1_m)
AP.tfidf_tdm.df <- data.frame(AP.recaps.tfidf_tdm.1_m)
#kmeans requires us to specify number of clusters - in this case, 5
#we also generate the clusters with and without scaling (ns)
AP.clusters.tf = kmeans(scale(AP.tf_tdm.df),5,nstart=20)
AP.clusters.tfidf = kmeans(scale(AP.tfidf_tdm.df),5,nstart=20)
AP.clusters.tf_ns = kmeans((AP.tf_tdm.df),5,nstart=20)
AP.clusters.tfidf_ns = kmeans((AP.tfidf_tdm.df),5,nstart=20)
For each of the 4 variations on the clustering strategy, we can see both the sizes of each of the five clusters and the content of the clusters, shown below.
AP.clusters.tf$size
clusters.tf <- AP.clusters.tf$cluster
(terms.tf.1 <- gsub(".txt","",names(which(clusters.tf==1))))
(terms.tf.2 <- gsub(".txt","",names(which(clusters.tf==2))))
(terms.tf.3 <- gsub(".txt","",names(which(clusters.tf==3))))
(terms.tf.4 <- gsub(".txt","",names(which(clusters.tf==4))))
(terms.tf.5 <- gsub(".txt","",names(which(clusters.tf==5))))
AP.clusters.tfidf$size
clusters.tfidf <- AP.clusters.tf$cluster
(terms.tfidf.1 <- gsub(".txt","",names(which(clusters.tfidf==1))))
(terms.tfidf.2 <- gsub(".txt","",names(which(clusters.tfidf==2))))
(terms.tfidf.3 <- gsub(".txt","",names(which(clusters.tfidf==3))))
(terms.tfidf.4 <- gsub(".txt","",names(which(clusters.tfidf==4))))
(terms.tfidf.5 <- gsub(".txt","",names(which(clusters.tfidf==5))))
AP.clusters.tf_ns$size
clusters.tf_ns <- AP.clusters.tf_ns$cluster
(terms.tf_ns.1 <- gsub(".txt","",names(which(clusters.tf_ns==1))))
(terms.tf_ns.2 <- gsub(".txt","",names(which(clusters.tf_ns==2))))
(terms.tf_ns.3 <- gsub(".txt","",names(which(clusters.tf_ns==3))))
(terms.tf_ns.4 <- gsub(".txt","",names(which(clusters.tf_ns==4))))
(terms.tf_ns.5 <- gsub(".txt","",names(which(clusters.tf_ns==5))))
AP.clusters.tfidf_ns$size
clusters.tfidf_ns <- AP.clusters.tfidf_ns$cluster
(terms.tfidf_ns.1 <- gsub(".txt","",names(which(clusters.tfidf_ns==1))))
(terms.tfidf_ns.2 <- gsub(".txt","",names(which(clusters.tfidf_ns==2))))
(terms.tfidf_ns.3 <- gsub(".txt","",names(which(clusters.tfidf_ns==3))))
(terms.tfidf_ns.4 <- gsub(".txt","",names(which(clusters.tfidf_ns==4))))
(terms.tfidf_ns.5 <- gsub(".txt","",names(which(clusters.tfidf_ns==5))))
AP.clusters.tfidf.dtm <- kmeans(data.frame(as.matrix(AP.recaps.tfidf_dtm.1)),5,nstart=20)
AP.clusters.tfidf.dtm$size
clusters.AP <- AP.clusters.tfidf.dtm$cluster
(docs.tfidf.1 <- gsub(".txt","",names(which(clusters.AP==1))))
(docs.tfidf.2 <- gsub(".txt","",names(which(clusters.AP==2))))
(docs.tfidf.3 <- gsub(".txt","",names(which(clusters.AP==3))))
(docs.tfidf.4 <- gsub(".txt","",names(which(clusters.AP==4))))
(docs.tfidf.5 <- gsub(".txt","",names(which(clusters.AP==5))))